Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Q-DETR: An Eﬃcient Low-Bit Quantized Detection Transformer

(1) Quantizing backbone (2) Quantizing encoder

(4) Quantizing MLPs

(3) Quantizing MHA of decoder

(1)

(1) + (2)

(1) + (2) + (3)

(1) + (2) + (3) + (4)

83.3

82.2

81.1

79.3

78.8

4-bit DETR-R50

-1.1

-1.8

-0.5

83.3

80.1

79.3

77.2

76.8

3-bit DETR-R50

-0.8

-2.1

-0.4

FIGURE 2.11

Performance of 3/4-bit quantized DETR-R50 on VOC with diﬀerent quantized modules.

2^a⁻¹−1, Q^w

n ⁼⁻²^b⁻¹^{, Q}^w

p ^{= 2}^b⁻¹⁻^{1 are the discrete bounds for}^a^{-bit activations and}

b-bit weights. x generally denotes the activation in this paper, including the input feature

map of convolution and fully-connected layers and input of multi-head attention modules.

Based on this, we ﬁrst give the quantized fully-connected layer as:

Q-FC(x) = Qa(x) · Qw(w) = αxαw ◦(xq ⊙wq + z/αx ◦wq),

(2.25)

where · denotes the matrix multiplication and ⊙denotes the matrix multiplication with

eﬃcient bit-wise operations. The straight-through estimator (STE) [9] is used to retain the

derivation of the gradient in backward propagation.

In DETR [31], the visual features generated by the backbone are augmented with posi-

tion embedding and fed into the transformer encoder. Given an encoder output E, DETR

performs co-attention between object queries O and the visual features E, which are for-

mulated as:

q = Q-FC(O), k, v = Q-FC(E)

Ai = softmax(Qa(q)i · Qa(k)^⊤

i ^/

√

d),

Di = Qa(A)i · Qa(v)i,

(2.26)

where D is the multi-head co-attention module, i.e., the co-attended feature for the object

query. The d denotes the feature dimension in each head. More FC layers transform the

decoder’s output features of each object query for the ﬁnal output. Given box and class

predictions, the Hungarian algorithm [31] is applied between predictions and ground-truth

box annotations to identify the learning targets of each object query.

2.4.2

Challenge Analysis

Intuitively, the performance of the quantized DETR baseline largely depends on the in-

formation representation capability mainly reﬂected by the information in the multi-head

attention module. Unfortunately, such information is severely degraded by the quantized

weights and inputs in the forward pass. Also, the rounded and discrete quantization signif-

icantly aﬀect the optimization during backpropagation.

We conduct the quantitively ablative experiments by progressively replacing each module

of the real-valued DETR baseline with a quantized one and compare the average precision

(AP) drop on the VOC dataset [62] as shown in Fig. 2.11. We ﬁnd that quantizing the MHA